Detect orphaned KG certs via public key modulus comparison#6020
Detect orphaned KG certs via public key modulus comparison#6020Robbie-Microsoft wants to merge 8 commits into
Conversation
After a VM or app restart, KeyGuard regenerates the per-boot private key in the same CNG container while the old cert remains on disk. The cert appears valid (time-wise, HasPrivateKey=true) but the key material has changed. Fix: compare the cert's embedded public key modulus against the public key currently exported from the CNG container. A mismatch conclusively means the container was regenerated and the cert is orphaned. The check runs inside WindowsPersistentCertificateCache.Read during candidate selection, so the loop continues looking for a valid cert rather than returning an unusable one. Non-CNG keys (software CSP) are accepted on faith since KeyGuard-style regeneration does not apply to them. CryptographicException during key load is treated as orphaned (key inaccessible = unusable). The CNG container modified-time heuristic (Check 3 from the original proposal) is intentionally omitted. Both it and the modulus comparison detect the same event, but the modulus check is definitive: independently generated RSA keys sharing a modulus is computationally infeasible. The modified-time check is a heuristic with a known false-negative window and adds no additional coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a Windows KeyGuard-specific “orphaned cert” detection to MSAL’s persistent mTLS certificate selection logic by validating that the certificate’s embedded RSA public key matches the current RSA public key in the associated CNG container. This prevents MSAL from reusing a persisted cert whose key container was regenerated after reboot, avoiding repeated SCHANNEL handshake failures.
Changes:
- Add modulus/exponent comparison helpers (
IsCertKeyOrphaned,PublicKeyMatchesCert) to detect CNG-container/key mismatches. - Update
WindowsPersistentCertificateCache.Readto skip orphaned candidates and continue searching for a usable cert. - Add unit tests covering the new helper methods.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs | Adds tests for orphan detection and RSA public key matching logic. |
| src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs | Skips persisted cert candidates whose CNG container key no longer matches the cert’s embedded public key. |
| src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs | Introduces orphan-detection helpers based on RSA modulus/exponent comparison with logging on crypto failures. |
Comments suppressed due to low confidence (1)
tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs:912
- Same issue as the KeyMatches test: on
net48,RSA.Create(2048)commonly returnsRSACryptoServiceProvider, sokey2 as RSACngwill be null and the test will fail. Createkey2as anRSACngexplicitly (or gate the test onkey2 is RSACng) to keep the multi-targeted test suite passing.
using var key1 = RSA.Create(2048);
using var key2 = RSA.Create(2048);
var rsaCng2 = key2 as RSACng;
Assert.IsNotNull(rsaCng2, "Expected RSACng on Windows.");
RSA.Create(2048) does not return RSACng on ARM64 Windows (.NET 8). Use new RSACng(2048) explicitly so the cast is always valid. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When IsCertKeyOrphaned detects a cert whose CNG container key no longer matches the cert's public key (post-reboot KeyGuard regeneration), the cert is now removed from the Windows cert store instead of just skipped. Previously the orphaned cert stayed in the store indefinitely, causing a redundant modulus check on every Read call. Changes: - WindowsPersistentCertificateCache.Read: collect orphaned cert thumbprints during the scan loop; after the loop, open ReadWrite store and remove them (best-effort, wrapped in try/catch) - Added internal Read overload accepting Func<X509Certificate2, ILoggerAdapter, bool> isOrphaned delegate for testability; IPersistentCertificateCache interface unchanged - Added unit test: Read_RemovesOrphanedCert_FromStore Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetRSAPrivateKey() returns null for both non-RSA certs (ECDSA, DSA) and for RSA certs with an inaccessible private key. Previously both fell into the 'accept on faith' branch, which could allow a useless RSA cert to pass the orphan check. Fix: check GetRSAPublicKey() to distinguish RSA certs from non-RSA certs. An RSA cert with no accessible private key is treated as orphaned. Non-RSA certs continue to be accepted on faith (KG regeneration does not apply). Add unit test: IsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKey Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs:190
RemoveByThumbprintsusesList<string>+thumbprints.Contains(...)inside a loop over all certificates, making removal O(storeSize * orphanCount). Since this can run on the read path after a reboot, consider using aHashSet<string>(e.g., OrdinalIgnoreCase) for thumbprints (and/orX509Certificate2Collection.Findby thumbprint) to keep the removal cost predictable.
int removed = 0;
foreach (var cert in items)
{
try
{
if (thumbprints.Contains(cert.Thumbprint))
{
store.Remove(cert);
removed++;
…ock orphan removal - Rename internal Read overload to public TryRead with out param last (public Read delegates to TryRead with the default IsCertKeyOrphaned check) - Extract SnapshotCertificates(store, logger) helper to eliminate 5 duplicate try/catch CopyTo fallback blocks across Read, Write, DeleteAllForAlias, PruneExpiredForAlias, and RemoveByThumbprints - Wrap RemoveByThumbprints in InterprocessLock.TryWithAliasLock (300ms, best-effort) — same pattern as Write/Delete/DeleteAllForAlias to prevent concurrent store writes during orphan cleanup - Update orphan removal test to call TryRead with the new signature Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Two suggestions to address before merge so #6031 closes here in one PR:
-
Add a CanSign liveness probe right after CngKey.Open in WindowsCngKeyOperations.TryGetOrCreateKeyGuard - 1-byte RSA-SHA256 sign. Catches the zombie-VBS case where the container's public metadata survives a reboot but the private material is dead. In that state ExportParameters(false) still returns the old modulus, so IsCertKeyOrphaned returns false and we'd fall through to a failed mTLS handshake.
-
Add a Broad issuer-CN sweep of CurrentUser\My when a fresh KG key is minted - when the container was just regenerated, every cert under CN=managedidentitysnissuer.login.microsoft.com is orphaned by definition. One-shot cleanup at the cause site, no per-Read discovery cost, handles multi-identity hosts uniformly. Your per-Read removal still covers non-reboot drift.
I validated both E2E on a Server 2022 KeyGuard VM, and it works great.
|
For reference, I pushed both additions as a draft in #6037 with the full implementation and unit test, feel free to cherry-pick from there or I can re-spin them as commits on this branch, whichever is easier. |
…TLS certs on reboot Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition. On Azure VM restart the per-boot KeyGuard key (NCryptUsePerBootKeyFlag) is reaped by VBS, but the persisted binding cert under CN=managedidentitysnissuer.login.microsoft.com still references the old public key. The next call then either burns a failed TLS handshake before the reactive SChannel catch kicks in, or — in the zombie-handle variant — falls through entirely because the cert's modulus still matches the dead container. Changes ------- - Add CanSign liveness probe right after CngKey.Open in WindowsCngKeyOperations.TryGetOrCreateKeyGuard. 1-byte RSA-SHA256 PKCS1 sign; ~1-3ms, runs once per process (result is cached in WindowsManagedIdentityKeyProvider._cachedKey). Catches zombie-VBS state where Open succeeds but private material is dead. - Add PurgeManagedIdentityCertificates: one-shot issuer-CN substring sweep of CurrentUser\My, invoked at the moment a fresh KeyGuard key is minted (both the probe-failed path and the Open-threw path). Removes orphaned binding certs at the cause site so the next request doesn't pay any per-Read discovery cost and multi-identity hosts (SAMI + UAMIs sharing the KeyGuard container) are cleaned up uniformly. - Add 4 Windows-only unit tests for the purge filter behavior (matching, non-matching, case-insensitive, only-removes-matching). The reactive SChannel catch in ImdsV2ManagedIdentitySource is retained as a defensive backstop. Validation ---------- Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and mixed SAMI/UAMI cases. Canonical post-reboot first call: - CngKey.Open threw CryptographicException HR=0x8009003A - Fresh KeyGuard key created - PurgeManagedIdentityCertificates removed orphan cert (Inspected=4) - MAA attestation OK - POST /issuecredential -> 200 - mTLS handshake -> 200 on first try (no reactive catch invoked) - Total ~2.8s on cold start Full unit suite green on net8.0: 2069 passed, 0 failed, 19 skipped. Refs #6031. Complementary to #6020 (cert-side modulus comparison): this PR adds the key-side liveness probe and broad issuer-CN sweep at the mint site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cherry-picked 1144df9 into this PR as 8b0b9a8. Your CanSign probe covers the zombie-VBS case my modulus check misses, and the issuer-CN sweep handles multi-identity hosts cleanly |
- RemoveByThumbprints: use HashSet<string>(OrdinalIgnoreCase) for thumbprint lookup (defensive; .NET's Thumbprint getter always returns uppercase, but normalize anyway). - PurgeManagedIdentityCertificates: skip null entries in the snapshot consuming loop (defensive against TOCTOU between Certificates.Count and enumeration). - WindowsPersistentCertificateCache.TryRead (testability overload): throw ArgumentNullException when the injected isOrphaned delegate is null. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // Snapshot to avoid 'collection modified during enumeration' provider quirks. | ||
| var snapshot = new X509Certificate2[store.Certificates.Count]; | ||
| try | ||
| { | ||
| store.Certificates.CopyTo(snapshot, 0); | ||
| } | ||
| catch (Exception copyEx) | ||
| { | ||
| logger?.Info(() => | ||
| $"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: store snapshot via CopyTo failed " + | ||
| $"({copyEx.GetType().Name}: {copyEx.Message}). Falling back to enumeration."); | ||
|
|
||
| int i = 0; | ||
| snapshot = new X509Certificate2[store.Certificates.Count]; | ||
| foreach (X509Certificate2 c in store.Certificates) | ||
| { | ||
| snapshot[i++] = c; | ||
| } | ||
| } |
Problem
After a VM or app restart, KeyGuard regenerates the per-boot private key in the same CNG container while old binding certs remain on disk under
CN=managedidentitysnissuer.login.microsoft.com. The certs appear valid (time-wise,HasPrivateKey=true) but the key material has changed (or, in the zombie-VBS case, the container metadata survives but the private material is dead). MSAL loads such certs from the persistent store, attempts an mTLS handshake, and fails.This PR addresses three distinct failure modes:
CngKey.Openthrows /GetRSAPrivateKey()returns nullCreateFreshhasn't run yet in this processCreateFreshran (withOverwriteExistingKey); cert references previous key materialFix — three complementary layers
1. Per-
Readmodulus comparison (WindowsPersistentCertificateCache)Compare the cert's embedded public key (modulus + exponent) against the public key currently in the CNG container. A mismatch conclusively means the container was regenerated — the cert is orphaned. Runs inside
WindowsPersistentCertificateCache.Readduring candidate selection. Orphaned certs are removed from the store on detection so they don't accumulate and don't require the modulus check on every subsequentRead.Covers: Mode 2 (and non-reboot drift scenarios).
2.
CanSignliveness probe (WindowsCngKeyOperations.TryGetOrCreateKeyGuard)A 1-byte RSA-SHA256 PKCS1 sign right after
CngKey.Open. ~1–3 ms, runs once per process (the result is cached inWindowsManagedIdentityKeyProvider._cachedKey). If the probe fails, the container is treated as dead andCreateFreshis invoked.Covers: Mode 3. Without this,
ExportParameters(false)returns the old modulus from intact metadata, the per-Readmodulus check passes, and the failure surfaces only at the mTLS handshake.3. Issuer-CN sweep on fresh-key mint (
PurgeManagedIdentityCertificates)One-shot substring sweep of
CurrentUser\MyforCN=managedidentitysnissuer.login.microsoft.com, invoked at the moment a fresh KeyGuard key is minted (both the probe-failed path and theOpen-threw path). Removes orphan binding certs at the cause site.Covers: Modes 1 and 2 at the cause site — multi-identity hosts (SAMI + UAMIs sharing the KeyGuard container) are cleaned up uniformly, no per-
Readdiscovery cost.The reactive SChannel catch in
ImdsV2ManagedIdentitySourceis retained as a defensive backstop.Design notes
GetRSAPrivateKey()returning null for an RSA cert is treated as orphaned (inaccessible = unusable); non-RSA certs (ECDSA, etc.) are accepted on faithCryptographicExceptionduring key load is treated as orphanedTryReadoverload accepts an injectableisOrphaneddelegate for testability;IPersistentCertificateCacheinterface is unchangedRemoveByThumbprintsis wrapped inInterprocessLock.TryWithAliasLock(300 ms best-effort) — same pattern asWrite/Delete/DeleteAllForAliasSnapshotCertificateshelperTesting
Unit tests (Windows-only where RSACng is required):
PersistentCertificateStoreUnitTests.cs— 6 tests:IsCertKeyOrphaned_ReturnsTrue_For_NullCertIsCertKeyOrphaned_ReturnsFalse_For_ValidCertIsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKeyPublicKeyMatchesCert_ReturnsTrue_When_KeyMatchesCertPublicKeyMatchesCert_ReturnsFalse_When_ModulusMismatch(different RSACng simulates post-reboot key replacement)Read_RemovesOrphanedCert_FromStore(verifies cert is deleted from store andReadreturns false)WindowsCngKeyOperationsPurgeUnitTests.cs— 4 tests for the purge filter:Full unit suite: 2068 passed, 0 failed, 19 skipped (net8.0).
E2E validation (Gladwin, Server 2022 KeyGuard VM, multiple reboots, mixed SAMI/UAMI):
CngKey.OpenthrewCryptographicExceptionHR=0x8009003APurgeManagedIdentityCertificatesremoved orphan cert (Inspected=4)POST /issuecredential→ 200Credits
Modulus-comparison layer authored by @Robbie-Microsoft. CanSign probe + issuer-CN sweep authored by @gladjohn (cherry-picked from #6037). Approach discussed with @bgavrilMS, @Raghavendra-ms, @dragosav.
Companion to #6017 (token_not_after OID eviction — different failure mode). Refs #6031.